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ABSTRACT 

Strongly lensed variable quasars can serve as precise cosmological probes, pro¬ 
vided that time delays between the image fluxes can be accurately measured. 
A number of methods have been proposed to address this problem. In this 
paper, we explore in detail a new approach based on kernel regression esti¬ 
mates, which is able to estimate a single time delay given several datasets for 
the same quasar. We develop realistic artificial data sets in order to carry out 
controlled experiments to test of performance of this new approach. We also 
test our method on real data from strongly lensed quasar Q0957+561 and 
compare our estimates against existing results. 

Key words: Gravitational lensing, quasars, Q0957+561A,B, Time series, 
kernel regression, statistical analysis, Multiobjective algorithms 


1 INTRODUCTION 


Time delays between images of strongly-lensed dis¬ 
tant variable sources can serve as a valuable tool 
for cosmography, providing an alternative to other 
tools, such as CMB measurements and distance mea- 
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because of the unknown intrinsic source variability, 
the limited observational cadence, and the measure¬ 
ment noise. A number of methods have been de¬ 
veloped to accurately estimate time del ays. These 
inclu d e the dispersion s p ectra method ( Pelt et ^ 
1 19981 : IVuissoz et al.l l2007l : ICourbin et ah 2011 ); the 
polynomial and cu r ve-fit ting methods | Vuissoz et al.l 
l2008l : lEulaers et al.l l2013ll : the free-knot spline, vari¬ 
ability of regression differences (based on Gaus¬ 
sian process regression), and dispe rsion minimisa¬ 
tion aTewes. Courbin fc Me^l^ [20lj) ; Gau ssian pro¬ 
cess modelling (e.g., iHoilath Kim fc Linderll20ldl : and 
the combined method bas ed on the PRH approach 
(iHirv. Olspert &: Peltll20lJ) . However, this remains an 
active area of research, especially in view of the upcom¬ 
ing surveys such as LSST which will provide unprece- 
dent ed data sets with strongly lensed distant quasars 
(e.g., Treu__et_ai] 2 013|) (and the recent m ock data chal¬ 
lenge Dobler et al. 1 120151: iLiao et al.ll2015r) . 


Some of the authors of the present work 
previously proposed a kernel-based method with 
variable width (K-V) for time delay estimation 
(iGuevas-Tello. Tifio fc RavchaudhurvI l2006l l . This was 
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combined with an evol utionary algorithm (EA) for pa¬ 
rameter optimisation il Cuevas-Tello et al.ll2009l l. How¬ 
ever, the c omputational time complexity of EA method 
is 0(n®) ll Cuevas- Telld I^OOTI I. This restriction makes 
it inadequate for hand ling long time series, e.g. 
ISchild fc Thoms'^ lll99lf l dat£0. This complexity is due 
to matrix inversion in kernel-based methods for weights 
estimation. Automatic methods for time delay estima¬ 
tion have been proposed to speed up algorithms in or¬ 
der to deal with l ong time series, based on Artificial 
Neural Networks llGonzalez-Grimaldo fc Cuevas-Tello 


tion h as been proposed dCuevas-Tello fc Perez- Gonzalej 

iml). 

The main contribution of the present paper is a new 
probabilistic method that is efficient, robust to observa¬ 
tional gaps, capable of directly incorporating measured 
noise levels reported for individual flux measurement, 
and able to estimate a single time delay given several 
datasets for the same quasar. We also carefully construct 
synthetic data sets within the framework of multiobjec¬ 
tive optimization to reproduce realistic flux variability, 
observational gaps, and noise levels. This allows us to 
test our proposed kernel regression estimate method on 
synthetic as well as real data, in order to measure the 
bias and variance of the method. 

The paper is organised as follows. Section ^ 
presents the Nadaraya-Watson estimator with known 
noise levels (henceforth NWE) and in |3] we extend 
it to a linear noise model with unknown noise (hence¬ 
forth NWE++). In section @ we discuss two previously 
proposed delay methods, cross correlation and disper¬ 
sion spectra, to compare to the new approach. Section 
35 ] shows the real datasets studied in this paper, and 
presents our procedure for generating synthetic data. 
The results are given in ^ and we conclude with a 
summary in 0 


2008) 


2012 ) 


these can be parallelized (iGuevas-Tello et al 
Alternatively, a simple hill-climbing optimisa- 


2 THE MODEL 

We consider a distant point source (e.g., a quasar) 
with two strongly lensed imagefl referred to as A 
and B, and one or more time series of flux measure¬ 
ments, possibly taken by different instruments and/or 
at different frequencies. The entire data collection D = 
{D^, consists of L data sets , (. € [1,L], 

each corresponding to a sequence of measurements 
taken with a given instrument and at a given frequency. 
Data sets consist of flux measurements of both im¬ 
ages, y\ and yg, taken at a non-uniform sequence of 
observational times times t\,t 2 , ■■■, ■ 

Formally, each set contains 3-tuples 


(fii VA,k,yB,k)^ fc — 1, 2, ..., N^, 


^ (^2^yA,2^yB,2)^ •••’ ^yA N^^yB 


where yA,k ^'iid ys.k denote the observed fluxes of image 
A and B, respectively, in at time t|,. We also assume 
that the standard errors k ag j, are known for 
each observation y^ k y^g.k^ respectively. 

The fluxes corresponding to the two images A and 
B are collected in sets 


Da = {(4,l/li),(4,l/l2),-,(fy^,l/i,iv0} 

and 


Db — {(Hi 2/fl,l)i (^21 J/S,2)> •■•I (^Ar<i 

For observations at frequencies above a few tens of 
MHz, dispersion yields sub-hour arrival time differences, 
and is not significant relative to typical time-delay mea¬ 
surement accuracy. We therefore assume that the time 
delay between gravitationally lensed fluxes does not de¬ 
pend on the wavelength at which the observations are 
taken. We also assume stationarity of the lensing ob¬ 
ject (e.g., a galaxy) in the sense that the delay does not 
change in time; in particular, we ignore micro-lensing 
contributions. 


2.1 Nadaraya-Watson Estimator with Known 
Noise Levels (NWE) 

Given a delay A, we seek to find a probabilistic model 
p(D|A) that explain^ D. Assuming independence of the 
observation sets we obtain 

L 

p{D\A)^l[p{D‘\A). 

e=i 

Assuming independent observations at distinct mea¬ 
surement times, we get 

n‘ 

p{d‘\a) = A) 

k=l 

and further assumption of independence of measure¬ 
ment noise in images A and B leads to 

p{yA,k,yB,k\tk, =PA(2 /A,fe|fi, A) pBiyi,k\ti,A). 

2.1.1 Modelling the source using image A 

It is typically assumed that the measurement uncertain¬ 
ties on fluxes and Dg are normally distributed, with 
zero mean Gaussian noise of known standard deviation 
k Emd Og j, associated with noisy observations Pa k 
and yg respectively. We model the mean of image 


^ http;//cfa-www.harvard.edu/~rschild/fulldata2.txt 
^ generalisation to four images is straightforward 


® We slightly abuse mathematical notation as we are ac¬ 
tually building conditional models of flux values, given the 
observation times. 
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A us in g Nadaraya-W atson kernel regression (iNadaraval 
Il964ll . llWatsonlfl963) . 


fA{t) = ^yA,k 




(1) 


where f^{t) is the predicted flux at time t and 
\ is a kernel positioned at with bandwidth 
parameter K. We use the Gaussian kernel 


K{t,tk\ h) = exp 


{t - tkY \ 

K^itk) j 


where the kernel scale K(tk) at position tk is defined 
as the distance spanned by the h neighbours (to the 
left and to the right) of tk, i.e. K{tk) = tk+h — tk-h- 
This approach to modelling the noise should work when 
the autocorrelation length of the observed flux is much 
longer than any gaps in the data during which the flux 
is modelled via the Nadaraya-Watson kernel regression 
estimator. If the autocorrelation length of the observed 
flux, which can be estimated from a time interval when 
the observations are relatively closely spaced, is compa¬ 
rable to or larger than a data gap, this approach (or any 
other approach that does not incorporate a physically 
accurate flux model) cannot be trusted. 

To respect the nature of gravitationally lensed data, 
we impose that the mean model for image B follows 
exactly that for image A, up to scaling by a constanlQ 
M > 0 and time shift by A: 

/i(f;A) = M/l(t-A). 


Since the shift A plays no role in modelling image 
A, we write 


where 


PA(j/A,fcl4) 


1 


\/27r ( 


exp 


f iiyik-fi«)r\ 

1 2 (4,,)2 I 


and 

PB(3/B,fel4> 


^ ^B,k 

'"‘’I 2 (4,j2 |- 


( 2 ) 

(3) 


(4) 


Note that given A, the only free parameter of 
P(2/A,fei J/B,fel4> ^) is the kernel width parameter in 
the formulation of the mean model ©• 

Ignoring constant terms and scaling, the negative 
log likelihood, — logp(D^|A), forms the approximation 
error for the set , 


sr 










B,k’ 


(5) 


^ assumed known, or easily estimated in a preprocessing 
stage using the means of the fluxes in and 


Writing down the negative log likelihood for the 
whole data, — logp(D|A), and ignoring scaling and con¬ 
stant terms leads to the total approximation error 

L 

EA{h-,A) = Y,Ei{h‘-,A), 

£^1 

where h = {h^, h? ,..., K) is a vector that collects kernel 
width parameters for all datasets , D ^,..., in D. 


2.1.2 Modelling the source using image B 


One can, of course, start by building a mean flux model 
fsii) for image B via Nadaraya-Watson kernel regres¬ 
sion. 


ft (f-, — . e 

/sW - z^yB,k fft 

fe—1 A/j—1 A^{b ) h ) 

imposing that the mean model of image A is 
/A(i;A) = ^ /i(t + A). 


(6) 


Crucially, since both images A and B come from the 
same source, we require that the kernel width for the 
mean models /a (I) and fsit) (and hence for /A(t;A) 
and /s(l; A) as well) be the same for the whole dataset 
D‘. 

Using the same reasoning as in section il2.1.1l we 
obtain an approximation error for the set D^\ 


EUh^-,A) = J2 

k = l 


iy‘ 


A,k 


s/i(4 + A))^ 


iyi,k-fU<)r\ 

J 


leading to the total approximation error 

L 

EB{h-,A) = Y,Eiih^-,A). 

t=i 


2.1.3 Estimating the Unique Time Delay across D 

Since there is no a-priori reason to prefer one image 
over the other, we aim to And the unique delay A that 
minimises both the errors i?A(h; A) and i?s(h; A) with 
the same ‘level of importance’. In other words, we are 
looking for A and the set of kernel width parameters 
h = {h^, h?, ...,h^), one for each dataset in D, that 
minimise the error 

A(h; A) = EA(h; A) -t £s{h; A). 

Note that the imposition that there is a unique 
delay A for the whole data D and that the kernel 
widths are the same throughout each set for all the 
corresponding mean models /K^), IbK), fAiU^) a^nd 
/^(f; A), not only makes sense from the point of view 
of underlying physics, but is also a stabilising factor in 
the analysis and modelling of D. 

The structure of our problem enables us to use an 
efhcient and practical approach to finding the optimal 
time delay A«. The error i5(h; A) to be minimised can 
be rewritten as 

L 

E(h; A) = A), (7) 

^=1 
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where 

e\P-,A) = E\{P-,A) + E^g{P-,A). 

For every test value A we can separately optimise 
A) for within each set . Note that this boils 
down into a set of L one-dimensional optimisations of 
bandwidths P. In addition, because of the 
nature of the mean models, the errors E^{h^-,A) will 
behave ‘reasonably’ with changes in P, i.e. the changes 
will be smooth and we can expect a roughly unimodal 
shape of cross-validated E^{h^;A). That enables us to 
use further speed-up tricks (such as halving) in the 1- 
dimensional optimisations. The estimated time delay is 
the one with the minimal overall F(h; A) for the (cross- 
validation) optimised kernel width parameters h. 


3 NADARAYA-WATSON ESTIMATOR 
WITH LINEAR NOISE MODEL 
(NWE++) 


In section |2] only the mean fluxes were modelled, the 
standard errors on observations were assumed known. 
Our approach can be extended to full probabilistic mod¬ 
elling by assuming a model for the relationship between 
the noise level and the observed fluxes. Here, we con¬ 
sider a simple model in which the standard error on 
the measured flux depends linearly on the observed flux 
value y, i.e., o(y) = v ■ y, where the proportionality 
constant u depends on the wavelength at which the 
flux is measured (e.g., u could be 1% and 0.1% for ra¬ 
dio and optical data, respectively). Assuming that the 
mean models for dataset are fitted reasonably well, 
so that yjk~ I G {A,B}, then to lowest order 

Avik )« > • fkti). 

Most of the material developed in sections [2] will 
stay unchanged, modifications are required only in the 
formulation of the noise models and 


PA(yA,k\^i) 


and 



exp 


-1 [ yA,k 

[fiiti) 



PB(yB,k\A^ 


1 

- A) 


• exp 


-1 r yi,k 

2{u^)^ [m/^(4-A) 



This time, however, we can write a full probabilis¬ 
tic model for any time point t and evaluate the like¬ 
lihood within our model given any observation pair 
{y^Add^y^Bp)) that could have been measured at time 
t: 


PAiyAd)) 


1 

fid) 


exp 


-1 

W 



( 10 ) 


and 

PB{yid)\^) = 


Mu^ fid - 


• exp 




vid) 


MfUt-A) 


- 1 


( 11 ) 


The approximation error Ei{h^;A) to be min 
imised by the choice of kernel width now reads: 

Ei(h‘-,A) ’ 




yA,k 




yB,k 


Mfiai - A) 


Following analogous arguments for the case of mod¬ 
elling the source using image B, we have 

PAivid)]^) = ^ 


fid + ^) 


■ exp 


-1 


' M yjit) 

/i(t + A) 


( 12 ) 


and 

PBiyid)) = 


fid) 


exp 


-1 


yid) 


fid) 


- 1 


which leads to the approximation error 


(•' ) fcl 


/ 

1 

2 

’ yB,k 

1 

[fidi + A) 


[fidl) \ 


Again, the final cost to be minimised is 

L 

E{h-,A) = J2EAh‘-,A), 
where 

E^P-, A) = Ei{P-, A) + Ei{P-, A). 


(13) 


4 PREVIOUS WORK 
4.1 Cross Correlation 

There are two versions of the methods based on cross 
correlation: the D i screte Correlation Function (DCF) 
(lEdelson fc Kroli^ 1 19881) and its variant, the Locally 
Normalised Discr ete Correlation Function (LNDCF) 
(iLehar et al.lll99^ ). Both calculate correlations directly 
on discrete pairs of light curves. These methods avoid 
interpolation in the observational gaps. They are also 
the simplest and quickest time delay estimation meth¬ 
ods. 

First, time differences (lags), Atij = tj—ti, between 
all pairs of observations are binned into discrete bins. 
Given a bin size Ar, the bin centred at lag r is the time 
interval 7 t = [r — Ar/2,T -|- Ar/2]. The DCF at lag r 
is given by 

" jyAdi) - o-)iyBdj) - b) 
i/(o'a - <^idi))Ai - ^%dj)) 

(14) 

where P(r) is the number of observational pairs in the 
bin centred at r, a and h are the means of the observed 


DCE(t) = 


P{t) 


E 
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Kernel regression time delay estimates 5 


data, yA{ti) and ysitj), and their variances are and 
respectively. 

Likewise, 

LNDCF(t) = 

P(r) 

, {VAiU) - a{T)){yB{tj) - 5(t)) 

i,j - (r\{ti)){al{T) - o-|(L)) 

(15) 

where air), b{T), (Ja('’~) and u^{r) are the lag means and 
variances in the bin centred at r. 

The time delay A is found when DCFij) and 
LNDCF{t), given by equations (11411 and (I15II. are 
greatest; i.e., at the bes t correlation l Edelson fc Kroli^ 
Il988l : iLehar et al.lll993 ). 


plots in Figure [T] The column labelled in Table [T] 
corresponds to the number of observations per dataset. 
The Data column shows whether the data are optical 
or radio and the Type column shows the filter and the 
frequency used to obtain the data. The Q0957+561 is a 
two-image quasar, so there is either an optical magni¬ 
tude offset or a flux ratio be tween images A and B. 
was provided by R. Schild (ISchild Sz ThomsonI Il997h . 
private communication. 

In order to consistently compare the performance of 
different time delay estimation methods in a controlled 
experimental setting, we also construct synthetic data 
on the basis of known gravitationally lensed fluxes In the 
optical and radio ranges, with the given observational 
noise and gaps structure. The “ground truth” - the delay 
- is imposed by us so that the statistics of different delay 
estimators can be consistently evaluated and compared. 


4.2 Dispersion Spectra 

The Dispersion Spectra method (IPelt et al.ll 19961 . 1 19981 ) 
measures the dispersion of time series of two light curves 
yA{ti) and ysitj) by combining them (given a trial 
time delay A and ratio M) into a single signal, y{tk), 
k = 1,2,...,2N. In other words, given the delay A, 
the observed values of signal A, {yA{ti)}k=i, and (de¬ 
layed and rescaled) signal B, {i/s(fi)}(Li, where yslt) = 
Mysit — ^), are joined together and re-ordered in time, 
forming a joint signal {y{tk)}k=i of length 2N. W e em¬ 
ploy two versions of this method l|Pelt et al .11 19981 ): 


DSfiA) 


E 


= min 
M 


2N-1 

a=l 


Wg {y(ta + l) - y{ta)f 


(16) 


and 


DS 2 a(A) =min 


Sg=, + 1 Ha,gWa,cGg,c ivitg) - y(n)f 
2 Ec=a + 1 Ha,cWa,cGg,A 


( 17 ) 


where 


1 


0-'^{ta + l) + 0-^{ta) 


, Wa,g = 


1 


0-2 (to) -I- cr'^itc) 


(18) 


are the statistical weights taking in account the mea¬ 
surement errors, where Go.c = 1 only when y{ta) and 
y{tc) are from different images, and Ga,c = 0 otherwise, 
and 


Hg,g = 


1 _ \ta ^tc\ , |ta-tc|s£<S 

0, otherwise. 


(19) 


Compared with DS\, the DS^a method has an ad¬ 
ditional parameter, the decorrelation length 5, which sig¬ 
nifies the maximum distance between observations that 
we ar e willing to consi der when calculating the correla¬ 
tions (iPelt et al.ll 19961) . 

The estimated time delay A is found by minimising DS^ 
over a range of time delay trials A, as above. 


5 DATA 

We employ six different datasets from the same quasar 
Q0957+561, L — 6. The details are in Table [T] and the 


5.1 Synthetic Data - Realistic Experimental 
Setting 


In this section we construct synthetic signals on which 
we will test the proposed and some of the existing ap¬ 
proaches to gravitational delay estimation in the pres¬ 
ence of observational noise and gaps. We constructed 
synthetic fluxes in the optic al range on the basis of 
(real r-band optical data of llSchild fc ThomsonI 1 19971) ) 
spanning roughly 10 and half years). In particular, we 
used to fit a distribution of possible fluxes ‘compat¬ 
ible’ with the data (formulated as a Gaussian process) 
and then sampled from this distribution synthetic fluxes 
of 3,500 observations. 

Gaussian process (GP) represents a distribution 
over functions 

f{t)~GP{y,At),K,^{t,t')), (20) 

with mean and covariance functions /igp(t) and 
Kgp{t,t'), respectively. Any sample from the GP 
corresponding to a finite set of observational times 
is Gaussian distributed with mean 
figp{t 2 ), • • ■ figp{tN) and covariance matrix 


( ^gpi^l ) ti) 
^gp (^2 ; ) 


^gp (^1 ? ^ 2 ) 
^gp (^2 ; ^2 ) 


^gp{t\ , tiv) ^ 
I^gp{ii2 ; tiV) 


\Kgp{tN ,ti) K gp{tN ^ 2 ) ■■■ Fgp{tjg ,t]g)/ 

( 21 ) 

For our purposes, we imposed zero mean (the mean of 
observations in was shifted to zero) and used the 
’squared exponential’ kernel function 


Kgp{t,t') = exp 


hip J 


( 22 ) 


with scale parameter hgp set using cross validation on 
D^. 

A vector (y,y«)^ of observations sampled at ob¬ 
servation times t and t« from the Gaussian process is 
distributed as 



( 23 ) 
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Q0957+561 at r-band (Schild) n=1232 



16.2 I-1-1-1-1-1-1-1 

4.5 4.6 4.7 4.8 4.9 5 5.1 

Julian day-2.4x10® 

(a) 


Q0957+561 at r-band (Ovaldsen) n=422 



Q0957+561 at r-band (Kundic) n=100 Q0957+561 at g-band (Kundic) n=97 




Q0957+561 at 6cm n=143 



■ Image A 
o Image B 




Julian day - 2.44x10® 

(e) 


Q0957+561 at 4cm n=58 



(f) D® 


Figure 1. Data set Q0957+561. Image A from is shifted up by 0.6 magnitudes for clarity; image A from is shifted up 
by 0.25 magnitudes; image A from is shifted up by 0.05 magnitudes. For more details on these datasets see Tabled 
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Kernel regression time delay estimates 7 


Table 1. Datasets: Q0957+561 


Id 

AT* 

Data 

Type 

Ratio/Offset 

Monitoring Range 

Ref 


1232 

optical 

r-band 

0.05 

16/11/1979 - 4/7/1998 

fSchild & Thomson 1997) 


422 

optical 

r-band 

0.076 

2/6/1992 - 8/4/1997 

fOvaldsen et al. 2003) 


100 

optical 

r-band 

0.21 

3/12/1994 - 6/7/1996 

fKundic et al. 1997) 


97 

optical 

g-band 

0.117 

3/12/1994 - 6/7/1996 

fKundic et al. 1997) 

D® 

143 

radio 

6cm 

1/1.43 

23/6/1979 - 6-Oct-1997 

(Haarsma et al. 1999) 


58 

radio 

4cm 

1/1.44 

4/10/1990 - 22/9/1997 

(Haarsma et al. 1999) 



Figure 2. Three Gaussian process posterior samples (dot¬ 
ted) based on (solid). Dashed curves signify ± 2 standard 
deviations. 


where Kgp, Kgpt and Kgpt* are kernel matrices corre¬ 
sponding to time instances t x t, t x t* and t* x t«, 
respectively. However, given observations y at times t, 
the conditional distribution of y* at times t* is given 
by 

p(y*|t*,y,t) = Ar(y*|p*,S*) (24) 

with 

h* = K]gp,K-^y (25) 

and 

S* = Kgp,, — K^p,Kgp Kgp,. (26) 

We sampled signals y* from the Gaussian process 
based on on a regular grid of 3,500 time stamps 
covering the temporal range of . As an example, we 
show three such signals in Figure [2l Dashed curves sig¬ 
nify ± 2 standard deviations. To create a pair of time 
shifted signals A and B, the smooth long signal (signal 
A) yA = y* was shifted in time by a delay A = 200 days 
to obtain signal B, 

yBit) = yA{t- A). (27) 

Finally, (as explained in greater detail in the following 
sections), we added observational noise independently 
to both signals A and B, and imposed observational 
gaps. 


5.1.1 Observational Noise 

Based on data, we first calculated the empirical dis¬ 
tribution p{p) of the ratio p of the reported flux levels 
yk and their associated standard errors ak'. Pk = ok/yk- 
For each observation y{t) in the synthetic stream we 
generated an additive observational noise from a zero 
mean Gaussian distribution with standard deviation 
o{t), where a{t) = p{t)y{t), with p{t) generated ran¬ 
domly i.i.d. from the empirical distribution p{p). 


5.1.2 Observational Gaps 

Real data are irregularly sampled due to practical 
considerations such as weather conditions, equipment 
availability, object visibility, etc. llEigenbrod et al]|2005l : 
I Cuevas- Telloll2007l) . Gaps in real data are characterised 
by two important quantities: gap size and gap position. 
The histogram in Figure [3Da) shows the empirical gap 
size distribution in . Shorter gaps of 1-5 days are 
more frequent than longer ones (more than 6 days). 

To make the synthetic data more realistic, we would 
like to respect constraints given by the gap size and 
inter-gap distance distributions for dominant gap sizes 
(up to 10 days). Gaps were imposed on the synthetic 
data by generating their sizes and positions through a 
multiobjective optimisation algorithm. The algorithm 
incorporated three constraints: (1) closeness of the gen¬ 
erated and empirical gap size distributions; (2) close¬ 
ness of the generated and empirical inter-gap interval 
distributions for gaps of 1-5 days; (3) closeness of the 
generated and empirical inter-gap interval distributions 
for gaps of 6-10 days. 


The particular algorithm we used was the 
computationally efficient R andom Weighte d Ge¬ 
netic Algorithm (R W GA) (iGho sh &: Dehuri| |2004|: 


Konak. Coit & SmithI 20061: 

Murata & Ishibuchil 

1995: Murata. Ishibuchi & Tanaka 19961: 

Zitzler. Deb & Thield l2000t). It 

uses a weighted 


average of normalised objectives for fitness assignment 
(for diversity imposition the weights are randomized). 
The procedure is outlined in Algorithm [T] 

The genome of each individual contains a sugges¬ 
tion for start positions and sizes of observational gaps. 
The design of individuals allows for a variable num¬ 
ber of gaps and ensures that the gaps are not overlap¬ 
ping. Figure [3] shows the results of applying the multi¬ 
objective genetic algorithm RWGA based on . Each 
objective corresponds to a row of two plots in Figure |21 
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Gap size (in days) 


(a) Objective 1 


Real data objective 2 



50 100 150 200 250 300 350 400 450 


Distance between Gap blocks <=5 
(c) Objective 2 



Distance between Gap blocks >5 


(e) Objective 3 


Synthetic data objective 1 



Gap size (in days) 
(b) Objective 1 


Synthetic data objective 2 



Distance between Gap blocks <=5 
(d) Objective 2 

Synthetic data objective 3 



Distance between Gap blocks >5 


(f) Objective 3 


Figure 3. Empirical distributions of gap size (a),(b), inter-gap interval for gaps of 1-5 days (c),(d) and inter-gap interval for 
gaps of 6-10 days (e),(f). Each objective of RWGA corresponds to a row of two plots, left and right plots showing empirical 
normalized histograms from the real (D^) and synthetic data, respectively. 


left and right plots showing empirical normalized his¬ 
tograms from the real and synthetic data, respectively. 

Generation of synthetic ’radio’ data proceeded in 
the same way as described in the previous section for 
optical data, this time based on data H®. 

5.2 Synthetic Data - Controlled Experimental 
Setting 

Generation of synthetic fluxes described above was mo¬ 
tivated by the desire to preserve realistic gap and noise 
distributions. We will refer to this approach as the ‘re¬ 


alistic’ experimental setting (RS). For comparing delay 
estimation algorithms in a large-scale controlled setting, 
we also considered an alternative specification of gap 
and noise distributions. The synthetic fluxes were first 
generated from the Gaussian Process model fitted to 
, as described in the previous section. The fluxes 
were then corrupted with observational gaps and noise. 
The gap sizes g were generated as realisations from a 
mixture distribution Puig) = ce ■ Psig', fig) + {1 — a) ■ 
Pu{g', Lg,Ug), where PB{g\ fig) is the Binomial distri¬ 
bution with mean fig and Pu{g; Lg,Ug) is the uniform 
distribution over [Lg,Ug]. We used the following set- 
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Algorithm 1 RWGA 

S = external archive to store non-dominated solutions 
found during the search so far; 

ns = number of elitist solutions immigrating from S to 
the population of potential solutions in each gener¬ 
ation L. 

Step 1: Generate a initial random population Xi, set 

L = 1. 

Step 2: Assign a fitness value to each individual solution 
X € At by performing the following steps: 

Step 2.1: Galculate the fitness Zo{x) for each objec¬ 
tive o = 1,... O. 

Step 2.2: Generate a random number in [0, 1] for 
each objective o = 1,... O 

Step 2.3: Calculate the random weight of each ob¬ 
jective o as Wo = :^ 

Step 2.4: Update the overall fitness of the solution 

X as r(x) = E°=i WoZo(x)- 

Step 3: Calculate the selection probability Ps{x) of each 
solution X £ At as follows: 

U(x) _ fzmin ’ 

where = min {F{x) I X G X,}. 

Step 4: Select parents using the selection probabilities 
calculated in Step 3. Mutate offspring with a predefined 
mutation rate. Copy all offspring to A^+i. 

Step 5: Randomly remove ns solutions from A^+i and 
add the same number of solutions from S to Ai+i. 

Step 6: If the stopping condition is not satisfied, set 
i = b -\- 1 and go to Step 2. Otherwise, return to S. 


tings: a = 0.95, pg = 4, 6 , 8 days, Lg = 20 and Ug = 80. 
The gap positions were randomised, subject to the con¬ 
straint of minimum inter-gap distance of 2 days. The 
allowed range for gap size was 1 to 80 days. For the ad¬ 
ditive Gaussian zero mean ‘observational’ noise we con¬ 
sidered three settings for the standard deviation: 0 . 1 %, 
0.2% and 0.3% of the flux level. We will refer to this 
approach as the ‘controlled’ experimental setting (CS). 


6 EXPERIMENTAL RESULTS 

We performed experiments on synthetic datasets de¬ 
scribed in section m as well as on real gravitation¬ 
ally lensed fluxes in the radio and optical ranges. In 
the experiments we compared our methods NWE and 
NWE-f-f, introduced in sections |2] and (3] respectively, 
with two dispersion spectra approaches, namely DSi 
and IIS' 2,4 and two cross correlation approaches DCF 
and LNDCF. 


6.1 Experiments on Synthetic Data 

As mentioned above, we set the ‘true’ time delay in the 
synthetic data to 200 days. The results of all approaches 


are based on testing time delay values in the range of 
175 to 225 days (1 day increment). 

It was found that the best setting for decorrela¬ 
tion length 3 in the DS' 2,4 method was 3 days. For 
NWE and NWE+-I- the kernel width h was estimated 
as variable kernel width with h = 2 neighbour^ For 
DCF and LNDCF, the bin size is set to 5 days, (see 
(ICuevas-Tello. Tino fc Ravchaudhurvll2006l D. 

For each method we show the mean (bias) p and 
standard deviation o of the maximum-likelihood delay 
estimates across experiments. In all plots, the true delay 
is represented by the horizontal line at /r = 200 . 

6.1.1 Realistic Experimental Setting (RS) 

For synthetic experiments in the realistic setting we gen¬ 
erated 500 base signals from the Gaussian process fit¬ 
ted to the optical data set , as described in section 
isu We then ran the RWGA algorithm to generate 
500 pareto front solutions for observational gap posi¬ 
tions and sizes fsee I5.1.2|l . Each base signal thus had a 
corresponding observational gap structure imposed on 
it. Finally, the signals were corrupted by observational 
noise (see 15.1. Ill . The same procedure was applied for 
generating 500 datasets in the radio range. 

Summary results for the RS experiments on the 500 
optical and radio data sets are presented in Tables[3and 
|3l respectively. We report the mean (/r) and standard 
deviation (a) of the delay estimates Ai, i = 1, 2,..., 500, 
the mean absolute error (MAE) of the delay estimates 
(MAE= X;®” |Ai - 2001/500), and the 95% Credibility 
Interval (Cl). The overall performance of the methods 
is also shown in Figure U) On smaller and noisier radio 
data the NWE is the best performing method, followed 
closely by NWEH—h. On optical data, the best perform¬ 
ing method is ^ 2 , 4 . It is important, however, to note 
that,in contrast to NWE methods, the dispersion spec¬ 
tra methods (DS) have parameters that are difficult to 
set objectively based on the given data only. In the ex¬ 
periments, we found the best DS parameter settings by 
imposing the true delay A = 200, which obviously bi¬ 
ases the DS results towards over-optimistic better per¬ 
formance levels. 

6.1.2 Controlled Experimental Setting (CS) 

For each setting of the Binomial gap distribution pg = 
4,6,8 days and for every noise level ratio from 0.1%, 
0.2%, 0.3% we generated 100 base signals from the un¬ 
derlying Gaussian process fitted on . We thus ob¬ 
tained 900 datasets. The length of the time series (after 
applying observational gaps) varied from 800 to 3000 
observations. 

An analogous procedure was used to generate 900 

® 2 neighbours came consistently as the favourite option 

when cross-validating the number of neighbours on several 
initial datasets. 
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Main results Main results 




Figure 4. RS results for optical and radio data 


Table 2. RS Results for optical range 


Method 

b 

-H 

MAE 

Cl range 

95% Cl 

NWE 

199.60±2.97 

2.19 

0.26 

[199.34,199.86] 

NWE+-I- 

199.83±3.23 

2.37 

0.28 

[199.55,200.11] 

DSf 

200.67±2.51 

1.05 

0.22 

[200.45,200.89] 

DSIa 

200.02±0.40 

0.16 

0.04 

[199.98,200.06] 

DCF 

199.14±13.77 

11.61 

1.21 

[197.93,200.35] 

LNDCF 

200.30±6.34 

4.47 

0.56 

[199.74,200.86] 


datasets in the radio range. For each setting of the Bi¬ 
nomial gap distribution Hg = 4, 6, 8 days and for every 
noise level ratio from 1%, 2%, 3% we generated 100 
base signals from the underlying Gaussian process fit¬ 
ted on H®. The overall results across all CS optical and 
radio datasets are summarized in Tables [4] and [5] re¬ 
spectively. Figures [5] and [6] present the results in greater 
detail, grouped by noise level and gap size. 

The kernel-based methods lead to more stable time 
delay estimates. NWE is the best performing method 
with respect to all performance measures, followed by 
NWE-f-f. It is interesting to note that while in general 
larger noise level ratio corresponds to larger standard 
deviation of the delay estimates, the DCE method seems 
to be more robust to increased noise levels. For low noise 
levels and with correlations between time-shifted data 
streams close to unity, the DCF method is, by construc¬ 
tion, relatively insensitive to the level of the noise. How¬ 
ever, it is still clearly outperformed by other techniques 
for the range of noise levels explored in this paper (see 
Figs. 4 and 5). 


6.2 Experiments on Real Data 

In this section, we present results of methods studied in 
this paper on real data - see Table [T] and Figure [T] Since 
for real data the noise levels related to observations are 
available, the NWEH—h method was not used. 

We have L = 6 datasets — D® and for all 


methods, we test values for time delay on the range of 
A = [400, 450] (increments of 1 day). The NWE cost to 
be minimised is i5(h; A) (eq. (0), with cross-validated 
kernel scale parameters h = (3, 2, 2, 2, 2, 2). 

For DCF and LNDCF, the bin size Ar was set to 
5, 5, 5, 5, 45, and 30 for , D^, D®, D'*', D®, and D®, 
respectively. As mentioned before, unlike in NWE, there 
is no objective way of setting such parameters based on 
the data only and we used the setting giving most robust 
results in the test range of delays 400-450 days. For a 
fixed delay A, the (LN)DCF function values at lag A 
are averaged across the 6 datasets — D® and the 
combined delay estimate is obtained at the maximum 
of the averaged (LN)DCF curve. 

For the Dispersion spectra method as argued 

above, the value of the decorrelation length parameter 
cannot be resolved in a principled manner based on the 
data and hence it was set to <5 = 3, since at this value 
DSi and DS^^a have more agreement. Again, for a fixed 
delay A, the DSi{A) and DS'| 4 (A) values are averaged 
across the 6 datasets and the combined delay estimate 
is obtained at the minimum of such averaged curves. 
The results (unique time delay across Q0957-1-561) are 
presented in Table |6l 


To measure the uncertainty of time delay estima¬ 
tions , following llHaars ma et afl 19991 : lOscoz et al ]| 19971. 
I2OOII : lOvaldsen et al.l 2003l l . we also performed Monte 
Carlo simulations by adding white noise generated ac- 
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Table 3. RS Results for radio range 


Method 

b 

-H 

MAE 

Cl range 

95% GI 

NWE 

199.47±3.71 

2.95 

0.32 

[199.15,199.791 

NWE++ 

200.37±4.31 

3.38 

0.38 

[199.99,200.75] 

DSl 

201.02±5.42 

4.42 

0.47 

[200.55,201.49] 

DSl^ 

204.20±9.98 

8.73 

0.87 

[203.33,205.07] 

DGF 

201.50±15.23 

13.10 

1.33 

[200.17,202.83] 

LNDGF 

199.94±5.83 

4.73 

0.51 

[199.43,200.45] 


Table 4. Overall CS Results across all observational gap and noise settings for optical range. 


Method 

b 

-H 

MAE 

Cl range 

95% GI 

NWE 

199.69±4.91 

3.76 

0.32 

[199.37,200.01] 

NWE++ 

199.69±5.78 

4.41 

0.38 

[199.31,200.07] 

DSl 

200.61±9.86 

7.62 

0.64 

[199.97,201.25] 

DSl^ 

199.97±14.10 

11.98 

0.92 

[199.05,200.89] 

DGF 

202.71±16.26 

14.22 

1.06 

[201.65,203.77] 

LNDGF 

200.63±10.56 

8.37 

0.69 

[199.94,201.32] 


Table 6. The unique time delay across Q0957+561 


Method 

(days) 

NWE 

420 

DSl 

435 

DSl^ 

435 

DGF 

408.78 

LNDGF 

426.31 


Table 7. Results of 500 Monte Carlo simulations: 
Q0957+561 


Method 

(d) 

a (d) 

NWE 

418.65 

0.49 

DSl 

434.98 

0.22 

DSl^ 

434.92 

1.08 

DGF 

408.77 

0.42 

LNDGF 

431.09 

15.04 


cording to the reported errors to each observatior0. For 
each data set we generated 500 randomized Monte Carlo 
realisations. The results (mean and std dev across the 
500 delay estimates) are presented in Table [T] 

Although we cannot compare these results against 
a known true value, it is apparent that time delay es¬ 
timates obtained with different methods are not mutu¬ 
ally consistent, unlike estimates on synthetic data. For 
example, DS^ and DCF estimates appear to lie more 
than 50 a apart. Moreover, we find that estimates using 
different frequency estimates on Q0957+561 data ap¬ 
pear to be inconsistent even when the same method is 
used. This suggests that the claimed measurement er¬ 
rors on the data are significantly under-estimated. Al- 

® Note that this effectively adds noise to already noisy ob¬ 
servations, resulting in a different noise distribution. For ex¬ 
ample, assuming the original noise is Gaussian, and adding 
random Gaussian noise from the same distribution, the stan¬ 
dard deviation of the noise distribution in this Monte Carlo 
data will be \/2 larger than the original one. 


ternatively, there may be unmodelled systematics (e.g., 
micro-lensing) that lead to varied biases for different 
analysis techniques. 


7 CONCLUSIONS 

We have introduced a new probabilistic efficient model- 
based methodology for estimating time delays between 
two gravitationally-lensed images of the same variable 
point source. The method enables one to use directly 
the noise levels reported for individual flux measure¬ 
ments. It is more robust to observational gaps than 
purely "unmodelled” techniques, since the imposition of 
an identical smooth model behind multiple lensed fluxes 
effectively regularizes the overall model fit, and conse¬ 
quently, the time delay estimate itself. Methods such as 
these will be useful in the automated search for time- 
delay systems as well as in the accurate measurement 
of delays in targeted systems in future very large time- 
domain surveys such as those planned for the Large Syn¬ 
optic Survey telescop e (LSST) (e.g. llHoiiati fc Lind^ 
|2015l;lLiao et al.ll201^1 1. 

The methods were tested and compared in two ex¬ 
perimental settings. In the realistic setting the synthetic 
data were generated so that multiple aspects of the real 
data were preserved: noise-to-observed flux ratio, obser¬ 
vational gap size distribution and the inter-gap interval 
distributions. The core synthetic signals were generated 
from a Gaussian process fitted to the real data. In the 
larger controlled experimental setting the signals gen¬ 
erated from the Gaussian process were subject to con¬ 
trolled levels of observational noise and gap sizes. Our 
method, while being computationally efficient, showed 
robustness with respect to noise levels and observational 
gap sizes. 

We also applied our method to real observed opti¬ 
cal and radio fluxes from quasar Q0957+561 as a com¬ 
bined dataset. Of course, with real data one can es¬ 
timate the variance of the estimator estimations, but 
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Figure 5. CS optical range results for NWE, NWE-|—t, DSf, DS22,4, DCF and LNDCF methods (plots (a), (b), (c), (d), (e) 
and (f), respectively) shown as functions of fig = 4,6,8 days (mean of the binomial gap size distribution) and observational 
noise level. In each case we present the mean and std dev of the delay estimates for the corresponding 100 data sets. 


never the bias, since the true time delay for Q0957-I-561 
is not known. Our NWE estimator on the combined op¬ 
tical and radio data suggests a delay of approximately 
420 days; however, we find that different estimators pro¬ 


duce inconsistent results, Indicating the presence of sta¬ 
tistical or systematic measurement errors in the data 
in excess of the claimed measurement uncertainty. In 
particular, the impact of microlensing corrections was 
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Table 5. Overall CS Results across all observational gap and noise settings for radio range. 


Method 

b 

-H 

MAE 

Cl range 

95% Cl 

NWE 

199.70±4.23 

3.24 

0.28 

[199.42,199.98] 

NWE++ 

199.89±5.07 

3.90 

0.33 

[199.56,200.22] 

Dl 

200.49±7.79 

5.92 

0.51 

[199.98,201.00] 

^4,2 

201.31±11.70 

9.36 

0.76 

[200.57,202.09] 

DCF 

201.13±15.70 

13.45 

1.03 

[200.10,202.16] 

LNDCF 

200.90±7.92 

5.96 

0.52 

[200.38,201.42] 


not accounted for in the present work, and needs to be 
quantified in the future. 
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Figure 6. CS radio rang results for NWE, NWEH— DS22,a-, DCF and LNDCF methods (plots (a), (b), (c), (d), (e) and 

(f), respectively) shown as functions of fig = 4,6,8 days (mean of the binomial gap size distribution) and observational noise 
level. In each case we present the mean and std dev of the delay estimates for the corresponding 100 data sets. 
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